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What are CNNs ? 


CNN = Neural Network with a convolution operation 

instead of matrix multiplication 
in at least one of the layers 



Neural Networks 



input layer 


output layer 


hidden layer 1 hidden layer 2 


Input example : one image 



Output example : one class 
airplane dog 

automobile frog 
bird horse 

cat ship 

deer truck 


















Atypical CNN architecture 


Feature maps 
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Convolutions 


Subsampling 


Convolutions Subsampling Fully connected 



















































Preview: ConvNet is a sequence of Convolution Layers, interspersed with 
activation functions 
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Preview: ConvNet is a sequence of Convolutional Layers, interspersed with 
activation functions 
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CONV, 
ReLU 
e.g. 10 
5x5x6 
filters 
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CONV, 

ReLU 
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Biological neuron & 
mathematical model 
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Convolution 



The convolution operation 




























The convolution operation 


Input 
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dx 
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3 reasons why convolution is cool 



Reason 1 : Sparse Connectivity 












Reason 2 : Parameter sharing 
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Reason 3 : Equivariant Representations 


When the input changes -> output changes in the same way 


Eg. Let I be a function giving images brightness at integer coordinates 
Let g be a function mapping one image function to another image function, 
such that r = g(l) is the image function with l'(x,y) = l(x - l,y). 

This shifts every pixel of I one unit to the right. 

If we apply this transformation to I, then apply convolution, 
the result will be the same as if we applied convolution to I', 
then applied the transformation g to the output. 



Convolution Layers 



Convolution Layer 


32x32x3 image 
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Convolution Layer 


32x32x3 image 



5x5x3 filter 



Convolve the filter with the image 
i.e. “slide over the image spatially, 
computing dot products” 
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Convolution Layer 


32x32x3 




image 


Filters always extend the full 
depth of the input volume 



5x5x3 filter 
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Convolve the filter with the image 
i.e. “slide over the image spatially, 
computing dot products” 
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Convolution Layer 

32x32x3 image 
5x5x3 filter w 



1 number: 

the result of taking a dot product between the 
filter and a small 5x5x3 chunk of the image 
(i.e. 5*5*3 = 75-dimensional dot product + bias) 
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activation map 


Convolution Layer 



32x32x3 image 
5x5x3 filter 



convolve (slide) over all 
spatial locations 
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Convolution Layer 



32x32x3 image 
5x5x3 filter 



convolve (slide) over all 
spatial locations 


consider a second, green filter 


activation maps 
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For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps: 



3 


32 


-> 

Convolution Layer 


activation maps 



We stack these up to get a “new image” of size 28x28x6! 
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Stride 




XI 













A closer look at spatial dimensions: 



32x32x3 image 
5x5x3 filter 


convolve (slide) over all 
spatial locations 


activation map 
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A closer look at spatial dimensions: 



7 


7x7 input (spatially) 
assume 3x3 filter 
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A closer look at spatial dimensions: 



7 


7x7 input (spatially) 
assume 3x3 filter 
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A closer look at spatial dimensions: 
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7x7 input (spatially) 
assume 3x3 filter 
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A closer look at spatial dimensions: 



7 


7x7 input (spatially) 
assume 3x3 filter 
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A closer look at spatial dimensions: 



7 


7x7 input (spatially) 
assume 3x3 filter 

=> 5x5 output 
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A closer look at spatial dimensions 



7 


7x7 input (spatially) 
assume 3x3 filter 
applied with stride 2 
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A closer look at spatial dimensions 



7 


7x7 input (spatially) 
assume 3x3 filter 
applied with stride 2 
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A closer look at spatial dimensions 



7 


7x7 input (spatially) 
assume 3x3 filter 
applied with stride 2 
=> 3x3 output! 
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A closer look at spatial dimensions 



7 


7x7 input (spatially) 
assume 3x3 filter 
applied with stride 3? 
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A closer look at spatial dimensions 



7 


7x7 input (spatially) 
assume 3x3 filter 
applied with stride 3? 

doesn’t fit! 

cannot apply 3x3 filter on 
7x7 input with stride 3. 
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N 


Output size: 

(N - F) / stride + 1 




e.g. N = 7, F = 3: 
stride 1 => (7 - 3)/1 +1=5 
stride 2 => (7 - 3)/2 + 1=3 
stride 3 => (7 - 3)/3 + 1 = 2.33 :\ 
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Zero-Padding 
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Zero-Padding: common to the border 
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e.g. input 7x7 

3x3 filter, applied with stride 1 

pad with 1 pixel border => what is the output? 


(recall:) 

(N - F) / stride + 1 
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Zero-Padding: common to the border 
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e.g. input 7x7 

3x3 filter, applied with stride 1 

pad with 1 pixel border => what is the output? 

7x7 output! 
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Zero-Padding: common to the border 
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e.g. input 7x7 

3x3 filter, applied with stride 1 

pad with 1 pixel border => what is the output? 

7x7 output! 

in general, common to see CONV layers with 
stride 1, filters of size FxF, and zero-padding with 
(F-1)/2. (will preserve size spatially) 
e.g. F = 3 => zero pad with 1 
F = 5 => zero pad with 2 
F = 7 => zero pad with 3 
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Examples time: 


Input volume: 32x32x3 
10 5x5 filters with stride 

Output volume size: ? 


-c* 
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Examples time: 


Input volume: 32x32x3 

10 5x5 filters with stride 1 , pad 2 




Output volume size: 
(32+2*2-5)/1+1 = 32 spatially, so 

32x32x1 0 
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Examples time: 


Input volume: 32x32x3 

10 5x5 filters with stride 1, pad 2 




Number of parameters in this layer? 
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Examples time: 


Input volume: 32x32x3 

10 5xf filters with stride 1, pad 2 

Number of parameters in this layer? 
each filter has * + 1 = 76 params 
=> 76*10 = 760 


-o 



(+1 for bias) 
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Summary 

Accepts a volume of size W\ x H x x D\ 

Requires four hyperparameters: 

° Number of filters K, 
o their spatial extent F, 
o the stride S, 

o the amount of zero padding P. 

Produces a volume of size W 2 x H 2 x D 2 where: 

o W 2 = (W 1 -F + 2P)/S + 1 

° H 2 — (Hi — F + 2 P)/S + 1 (i.e. width and height are computed equally by symmetry) 

O D 2 =K 

With parameter sharing, it introduces F ■ F • Di weights per filter, for a total of (F • F ■ Di ) • K weights 
and K biases. 

In the output volume, the d- th depth slice (of size W 2 x H 2 ) is the result of performing a valid convolution 
of the rf-th filter over the input volume with a stride of S, and then offset by d-th bias. 
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Common settings: 


Summary To summarize, the Conv Layer: 

• Accepts a volume of size W\ x Hi x D\ 

• Requires four hyperparameters: _ 

o Number of filters K, 
o their spatial extent F, 
o the stride S, 

o the amount of zero padding P. 

• Produces a volume of size W 2 x H 2 x D 2 where: 


K = (powers of 2, e.g. 32, 64, 128, 512) 

- F = 3, S = 1, P = 1 

- F = 5, S = 1,P = 2 

- F = 5, S = 2, P = ? (whatever fits) 

- F = 1,S = 1, P = 0 


o W 2 = (W 1 — F + 2P)/S + 1 

o H 2 = (Hi - F + 2 P)/S + 1 (i.e. width and height are computed equally by symmetry) 


o Do = K 


• With parameter sharing, it introduces F • F ■ Di weights per filter, for a total of (F • F ■ Di ) • K weights 
and K biases. 

• In the output volume, the d -th depth slice (of size W 2 x H 2 ) \s the result of performing a valid convolution 
of the d-Vn filter over the input volume with a stride of S, and then offset by d-th bias. 
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Local connectivity & 
tiled convolution 



Local connectivity 


Locally connected layer 



Convolutional layer 




Fully connected layer 



Locally connected layer 


Tiled convolution 


Convolutional layer 


Tiled convolution 







Pooling 



Pooling 
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max pooling 
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average pooling 
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Effect = invariance to small translations of the input 


















Pooling 


POOLING STAGE 



POOLING STAGE 



0.3 


0.1 


1 . 


0.2 


Pooling 

makes the representations smaller and more manageable 
operates over each activation map independently 


224x224x64 



pool 


112x112x64 

zzr 



1 


224 



downsampling 
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224 


112 


slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson 


























Max Pooling 


Single depth slice 
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max pool with 2x2 filters 
and stride 2 


CD 

00 

00 
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Summary 


Accepts a volume of size W\ x Hi x D\ 

Requires three hyperparameters: 
o their spatial extent F, 

° the stride S, 

Produces a volume of size IT 2 x H 2 x D 2 where: 
o W 2 = (Wx -F)/S + 1 
o H 2 = {H l -F)/S + 1 
o D2 — D\ 

Introduces zero parameters since it computes a fixed function of the input 
Note that it is not common to use zero-padding for Pooling layers 
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Summary 


Common settings: 


Accepts a volume of size W\ x Hi x D\ 

Requires three hyperparameters: 
o their spatial extent F, 

° the stride S, 

Produces a volume of size W 2 x H 2 x D 2 where: 

O W 2 = (Wi -F)/S + 1 
o H 2 = {H l -F)/S + 1 
o D2 — D\ 

Introduces zero parameters since it computes a fixed function of the input 
Note that it is not common to use zero-padding for Pooling layers 


F = 2, S = 2 
F = 3, S = 2 
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Back propagation 
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f(x,y, z) = (x + y)z 

e.g. x = -2, y = 5, z = -4 
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Want: 


df_ 21 21 

dx ’ dy ’ dz 
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Want: 
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Want: 
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Want: 


df_ 21 21 

dx ’ dy ’ dz 
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activations 
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activations 



t> 
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activations 
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activations 
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Patterns in backward flow 


add gate: gradient distributor 
max gate: gradient router 
mul gate: gradient... “switcher”? 


x 3.00 
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Activation function 



Activation Functions 


Xo w 0 
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Activation Functions 


Sigmoid 

cr(x) = 1/(1 + e ~ x ) 



tanh tanh(x) 



ReLU max(0,x) 

-10 -5 



Leaky 

ReLU 



Maxout ma x(w^x + fei, w^x + 62 ) 


ELU 
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Activation Functions 



Squashes numbers to range [0,1] 
Historically popular since they 
have nice interpretation as a 
saturating “firing rate” of a neuron 


Sigmoid 

a{x) = 1/(1 + e~ x ) 
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Activation Functions 



Sigmoid 

a{x) = 1/(1 + e ~ x ) 


- Squashes numbers to range [0,1] 

- Historically popular since they 
have nice interpretation as a 
saturating “firing rate” of a neuron 


1. Saturated neurons “kill” the 
gradients 

2. Sigmoid outputs are not zero- 
centered 

3. exp() is a bit compute expensive 
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Activation Functions 



tanh(x) 


Squashes numbers to range [-1,1] 

zero centered (nice) 

still kills gradients when saturated :( 


[LeCun et al., 1991] 
slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson 






Activation Functions 



Computes f(x) = max(0,x) 

Does not saturate (in +region) 
Very computationally efficient 
Converges much faster than 
sigmoid/tanh in practice (e.g. 6x) 


ReLU 

(Rectified Linear Unit) 


[Krizhevsky et al., 2012] 
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Activation Functions 



Computes f(x) = max(0,x) 

Does not saturate (in +region) 
Very computationally efficient 
Converges much faster than 
sigmoid/tanh in practice (e.g. 6x) 


ReLU - Not zero-centered output 

(Rectified Linear Unit) - ReLU units can “die” 
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Activation Functions 



Leaky ReLU 

f(x) = max(0.01cc, x) 


Does not saturate 
Computationally efficient 
Converges much faster than 
sigmoid/tanh in practice! (e.g. 6x) 


will not “die”. 


[Mass et al., 2013] [He et al., 2015] 
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In practice 


Use ReLU. Be careful with your learning rates 

Try out Leaky ReLU / Maxout / ELU 
Try out tanh but don’t expect much 
Don’t use sigmoid 
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Preprocessing data 



Preprocessing data 


original data zero-centered data normalized data 
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Preprocessing data 
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In practice: for images 

.g. consider CIFAR-10 example with [32,32,3] images 

Subtract the mean image (e.g.AiexNet) 

(mean image = [32,32,3] array) 

Subtract per-channel mean e.g. VGGNet) 

(mean along each channel = 3 numbers) 

Not common to normalize 
variance, to do PCAor 
whitening 
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Weights initialization 



Weights initialization 


If the weights in a network start too small, 

then the signal shrinks as it passes through each layer until it's too 
tiny to be useful. 

If the weights in a network start too large, 

then the signal grows as it passes through each layer until it's too 

massive to be useful. 
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Weights initialization 


• All zero initialization 


• Small random numbers 



• Draw weights from a Gaussian distribution 
with standard deviation of sqrt(2/n), 
where n is the number of outputs to the neuron 



Batch normalization 



Batch normalization 


Initialization of NNs by explicitly forcing the activations throughout 
the network to take on a unit Gaussian distribution at the beginning 
of the training. 


Normalization is a simple differentiable operation 


[Ioffe and Szegedy, 2015] 


Batch normalization 




Usually inserted after Fully 
Connected and/or Convolutional 
layers, and before nonlinearity. 
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Batch normalization 


Improves gradient flow through the network 
Allows higher learning rates 
Reduces the strong dependence on initialization 
Acts as a form of regularization in a funny way, and 
slightly reduces the need for dropout 


Thank you for your attention 


Deep Learning 



What society thinks I do What my friends think I do What other computer 

scientists think I do 



What mathematicians think I do What I think I do What I actually do 



AlexNet example 



Case Study: AlexNet 

[Krizhevsky et al. 2012] 



Input: 227x227x3 images 


First layer (C0NV1): 96 11x11 filters applied at stride 4 
=> 

Q: what is the output volume size? Hint: (227-11)/4+1 = 55 
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Case Study: AlexNet 

[Krizhevsky et al. 2012] 



Input: 227x227x3 images 


First layer (C0NV1): 96 11x11 filters applied at stride 4 
=> 

Output volume [55x55x96] 


Q: What is the total number of parameters in this layer? 
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Case Study: AlexNet 

[Krizhevsky et al. 2012] 



Input: 227x227x3 images 


First layer (C0NV1): 96 11x11 filters applied at stride 4 
=> 

Output volume [55x55x96] 

Parameters: (11*11*3)*96 = 35K 
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Case Study: AlexNet 

[Krizhevsky et al. 2012] 



Input: 227x227x3 images 
After C0NV1: 55x55x96 


Second layer (POOL1): 3x3 filters applied at stride 2 


Q: what is the output volume size? Hint: (55-3)/2+1 = 27 
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Case Study: AlexNet 

[Krizhevsky et al. 2012] 



Input: 227x227x3 images 
After C0NV1: 55x55x96 


Second layer (POOL1): 3x3 filters applied at stride 2 
Output volume: 27x27x96 

Q: what is the number of parameters in this layer? 
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Case Study: AlexNet 

[Krizhevsky et al. 2012] 



Input: 227x227x3 images 
After C0NV1: 55x55x96 


Second layer (POOL1): 3x3 filters applied at stride 2 
Output volume: 27x27x96 
Parameters: 0! 


slide from: Fei-Fei Li & Andrej Karpathy & Justin Johnson 






Case Study: AlexNet 

[Krizhevsky et al. 2012] 



Input: 227x227x3 images 
After C0NV1: 55x55x96 
After P00L1: 27x27x96 
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Case Study: AlexNet 

[Krizhevsky et al. 2012] 


Full (simplified) AlexNet architecture: 

[227x227x3] INPUT 
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0 
[27x27x96] MAX P00L1: 3x3 filters at stride 2 
[27x27x96] N0RM1: Normalization layer 
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 
[13x13x256] MAX POOL2: 3x3 filters at stride 2 
[13x13x256] NORM2: Normalization layer 
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1 

[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1 

[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1 

[6x6x256] MAX POOL3: 3x3 filters at stride 2 
[4096] : C6: 4096 neurons 
[4096] 7 C7: 4096 neurons 
[1000] : C8: 1000 neurons (class scores) 




\ 

* 


/ 1 



>'.v 

192 


^ \ 


\ 


dense 


pooling 
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Case Study: AlexNet 

[Krizhevsky et al. 2012] 


Full (simplified) AlexNet architecture: 
[227x227x3] INPUT 

[55x55x96] CONV1: 96 11x11 filters at stride 4, 
[27x27x96] MAX P00L1: 3x3 filters at stride 2 
[27x27x96] N0RM1: Normalization layer 
[27x27x256] CONV2: 256 5x5 filters at stride 1, 
[13x13x256] MAX POOL2: 3x3 filters at stride 2 
[13x13x256] NORM2: Normalization layer 
[13x13x384] CONV3: 384 3x3 filters at stride 1, 
[13x13x384] CONV4: 384 3x3 filters at stride 1, 
[13x13x256] CONV5: 256 3x3 filters at stride 1, 
[6x6x256] MAX POOL3: 3x3 filters at stride 2 
[4096] : C6: 4096 neurons 
[4096] 7 C7: 4096 neurons 
[1000] : C8: 1000 neurons (class scores) 



pad 0 


pad 2 


pad 1 
pad 1 
pad 1 


128 Max 

pooling 



\ 

* 


/ 1 



>'.v 

192 


^ \ 


\ 


dense 


2048 2048 


Details/Retrospectives: 

-first use of ReLU 

- used Norm layers (not common anymore) 

- heavy data augmentation 

- dropout 0.5 

- batch size 128 

- SGD Momentum 0.9 
-Learning rate 1e-2, reduced by 10 
manually when val accuracy plateaus 

- L2 weight decay 5e-4 

- 7 CNN ensemble: 18.2% -> 15.4% 
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